✏️

Week 1 lecture notes

Reviewing introductory papers on interpretability and linguistic probes inside the black box of neural language models

The Barest Thought of an Intro to Neural Nets

  • A brief recent history of neural networks
    • Neural networks are mathematical objects — for doing computation
      • The most common types can be boiled down to simple matrix multiplication - the forward pass
      • Models can be hard-wired or learned, typically using gradient-based methods like backpropagation
      • General framing: The model will try to learn some mapping from the input (e.g., some vector representing the pixels of an image) to an output (i.e., a prediction, such as a single number, or a vector of numbers, such as class probabilities)
      • Simplest multi-layer models (e.g., the multilayer perceptron) perform two stages of multiplication — some of which employ nonlinear transformations of an intermediate state
        • In some parts of the literature, especially older connectionist modeling papers, these transformations are sometimes called “activation functions”
        • Nonlinearities allow models to learn statistically interesting conjunctions of features — XOR problem
        • These conjunctions of features are also interactions — the same as the use of the term elsewhere in statistics — in which the value at one level depends on the value at some other level (e.g., a + b + ab)
        • Linguistic structure is highly interactive — there are usually multiple sources of information that influence how we interpret language
    • 5 years after Mikolov et al. (2013) - foundational word2vec paper, what was the state of research in NLP? A variety of models, e.g., ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), GPT-2 somewhere in there, and many, many more that have surfaced
    • Movement from recurrent structures (e.g., RNNs and LSTMs) to attention-based computations using Transformer architectures (e.g., Vaswani et al., 2017)
    • Terminological note: RNNs I use to refer to models that only have a recurrence mechanism to hold onto prior hidden states; LSTMs (with forget gates) are a very different architecture; Transformers learn in a similar way to RNNs and LSTMs but have no recurrence; predictions are computed simultaneously
    • Neural network models have massively grown in size and numbers of parameters
  • Big questions about neural networks:
    • What is in the input and output of these models? = What encoding representations are we using? What assumptions does using those representations make?
    • What can and do the models learn from the data?
    • How are they generally trained?

Readings

Alishahi, A., Chrupała, G., & Linzen, T. (2019). Analyzing and interpreting neural networks for NLP: A report on the first BlackboxNLP workshop. Natural Language Engineering, 25(4), 543-557. https://www.cambridge.org/core/journals/natural-language-engineering/article/analyzing-and-interpreting-neural-networks-for-nlp-a-report-on-the-first-blackboxnlp-workshop/FAFF1B645BBF89FE400A521526AA65D4

Notes

  • “Octopus paper” - Bender and Koller (2020) (“Climbing towards NLU”)
  • Wide variety of reasons to want interpretable models
    • Stakeholders in a business
    • Accountability for legal reasons (e.g., California or the EU)
  • “Black box” —> BlackboxNLP
  • Approaches outlined in BlackboxNLP
    • Developing annotated and specialized datasets to test models
    • Manipulation of the input to neural networks to test for importance of specific linguistic or demographic features
    • Developing diagnostic classifiers trained over intermediate representations from within a neural network model
    • Modifying neural network architectures to make them more explainable —> Simplify or distill the model to a smaller state
    • Designing training or testing datasets over simplified or formal languages
  • Input manipulation
    • Punctuation
    • Tokenization
    • Lemmatization
    • Chunking
    • Datasets
      • Diverse NLI - model must answer logical/semantic questions of varying linguistic complexity
      • GLUE - Benchmark dataset for different domains
      • Human reference points, e.g., children’s behavior in theory-of-mind experiments
      • Sentences of varying types of linguistic complexity (e.g., subject-verb agreement tests)
  • Developing diagnostic classifiers
    • Auxiliary task - Some other task (e.g., sentiment analysis)
    • Diagnostic classifiers - Is the presence or absence of a linguistic feature “in” the encoding/embedding/vector representation?
      • Can leverage the predictions of diagnostic classifiers to “nudge” a trained model in a more linguistic direction
      • Part-of-speech classifiers (e.g., NOUN, ADJ, VERB, PUNCT)
      • Subject-verb agreement (“The key(s) [to the cabinet(s)] is/are on the table”)
    • Nearest neighbors with a notion of conformity (Wallace, Feng, & Boyd-Graber, 2018) — removing a feature (e.g., a word from a passage) can influence overall representation
    • Probing
    • Decoding
  • Modifying neural network architectures
  • Simplified or formal languages
    • Cross-linguistic transfer between a large corpus and a small corpus to see how original learned representations do/do not get preserved when training on a “new” language
    • Formal languages
      • Recognizing whether a string is valid in some formal system or not
      • anbna^nb^n languages require a pushdown automaton — something that can keep track of the location of a previous state — complex interaction between activation functions (ReLU, GRU + LSTM or plain recurrent architecture)
      • RNNs and LSTMs perform poorly in parsing Dyck languages (matching open and closed brackets) on strings longer than what they were trained on
  • Desirable future links
    • Evaluation - “When an explanation matches what a human would see as a reasonable basis of a particular decision, it does not necessarily follow that this was the basis”
    • Benchmarks
    • Neuroscientific alignment
      • Growing area in natural language processing!
Rogers, A., Kovaleva, O., & Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8, 842-866. https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00349/96482/A-Primer-in-BERTology-What-We-Know-About-How-BERT

Notes

  • Syntactic knowledge
  • Define each of the following:
    • Linear versus hierarchical structure (e.g., “The cat the dog is sleeping next to is cute”)
    • Part-of-speech information (e.g., NOUN, ADJECTIVE, etc.)
    • Syntactic chunks (what sequences go together)
    • Roles (e.g., subject, object, arguments, adjuncts)
    • Named entity categories (memorization)
    • Pragmatic inference
    • Event knowledge
    • Syntactic relations (e.g., syntactic dependencies)
    • Subject-verb agreement
    • Anaphora

Andreas Madsen, Siva Reddy, and Sarath Chandar. 2022. Post-hoc Interpretability for Neural NLP: A Survey. ACM Computing Surveys. Just Accepted (June 2022). https://doi.org/10.1145/3546577

Notes

  • Motivations for interpretability
    • “incompleteness in the problem formalization”
    • Accountability
    • Safety
    • Ethics
    • Scientific understanding
  • Communication strategies in the interpretability literature
    • Local explanations (single observations)
    • Global explanations (the whole model)
    • Class explanations (multiple observations from a single class)
  • Intrinsic interpretability
  • Post-hoc interpretability - models that are built after an NLP system is trained to interpret its behavior
  • Measures of interpretability
    • Application-grounded - e.g., higher survival rates when doctors in conjunction with an AI save more lives than doctors (or AIs!) alone
    • Functionally-grounded - comparing with other post-hoc methods or intrinsically interpretable model (e.g., a linear model)
    • Human-grounded - An estimate of the utility to people in general (vs. researcher intuitions), e.g., the model people choose as the most accurate